[PoC] Reducing planning time on tables with many indexes

  • Jump to comment-1
    geidav.pg@gmail.com2022-07-27T12:42:37+00:00
    Hi hackers, We came across a slowdown in planning, where queries use tables with many indexes. In setups with wide tables it is not uncommon to have easily 10-100 indexes on a single table. The slowdown is already visible in serial workloads with just a handful of indexes, but gets drastically amplified when running queries with more indexes in parallel at high throughput. We measured the TPS and planning time of running parallel streams of simple point look-up queries on a single empty table with 60 columns and 60 indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No rows are returned because the table is empty. We used a machine with 64 physical CPU cores. The schema and sysbench script to reproduce these numbers are attached. We used the TPS as reported by sysbench and obtained planning time by running 'EXPLAIN ANALYZE' on the same query in a separately opened connection. We averaged the planning time of 3 successive 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying numbers of threads using the following command line: sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? --report-interval=1 --threads=64 run The following table shows the results. It is clearly visible that the TPS flatten out already at 8 parallel streams, while the planning time is increasing drastically. Parallel streams | TPS (before) | Planning time (before) -----------------|--------------|----------------------- 1 | 5,486 | 0.13 ms 2 | 8,098 | 0.22 ms 4 | 15,013 | 0.19 ms 8 | 27,107 | 0.29 ms 16 | 30,938 | 0.43 ms 32 | 26,330 | 1.68 ms 64 | 24,314 | 2.48 ms We tracked down the root cause of this slowdown to lock contention in 'get_relation_info()'. The index lock of every single index of every single table used in that query is acquired. We attempted a fix by pre-filtering out all indexes that anyways cannot be used with a certain query, without taking the index locks (credits to Luc Vlaming for idea and implementation). The patch does so by caching the columns present in every index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before opening (= locking) the indexes in 'get_relation_info()', we check if the index can actually contribute to the query and if not it is discarded right away. Caching the index info saves considerable work for every query run subsequently, because less indexes must be inspected and thereby locked. This way we also save cycles in any code that later on goes over all relation indexes. The work-in-progress version of the patch is attached. It is still fairly rough (e.g. uses a global variable, selects the best index in scans without restrictions by column count instead of physical column size, is missing some renaming, etc.), but shows the principle. The following table shows the TPS, planning time and speed-ups after applying the patch and rerunning above described benchmark. Now, the planning time remains roughly constant and TPS roughly doubles each time the number of parallel streams is doubled. The higher the stream count the more severe the lock contention is and the more pronounced the gained speed-up gets. Interestingly, even for a single query stream the speed-up in planning time is already very significant. This applies also for lower index counts. For example just with 10 indexes the TPS for a single query stream goes from 9,159 to 12,558. We can do more measurements if there is interest in details for a lower number of indexes. Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | Speed-up planning -----------------|-------------|-----------------------|--------------|------------------ 1 | 10,344 | 0.046 | 1.9x | 2.8x 2 | 20,140 | 0.045 ms | 2.5x | 4.9x 4 | 40,349 | 0.047 ms | 2.7x | 4.0x 8 | 80,121 | 0.046 ms | 3.0x | 6.3x 16 | 152,632 | 0.051 ms | 4.9x | 8.4x 32 | 301,359 | 0.052 ms | 11.4x | 32.3x 64 | 525,115 | 0.062 ms | 21.6x | 40.0x We are happy to receive your feedback and polish up the patch. -- David Geier (ServiceNow)
    • Jump to comment-1
      geidav.pg@gmail.com2022-07-27T12:46:24+00:00
      Sorry, by accident I sent this one out twice. -- David Geier (ServiceNow) On Wed, Jul 27, 2022 at 2:42 PM David Geier <geidav.pg@gmail.com> wrote: > Hi hackers, > > We came across a slowdown in planning, where queries use tables with many > indexes. In setups with wide tables it is not uncommon to have easily > 10-100 indexes on a single table. The slowdown is already visible in serial > workloads with just a handful of indexes, but gets drastically amplified > when running queries with more indexes in parallel at high throughput. > > We measured the TPS and planning time of running parallel streams of > simple point look-up queries on a single empty table with 60 columns and 60 > indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No > rows are returned because the table is empty. We used a machine with 64 > physical CPU cores. The schema and sysbench script to reproduce these > numbers are attached. We used the TPS as reported by sysbench and obtained > planning time by running 'EXPLAIN ANALYZE' on the same query in a > separately opened connection. We averaged the planning time of 3 successive > 'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying > numbers of threads using the following command line: > > sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost > --pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=? > --report-interval=1 --threads=64 run > > The following table shows the results. It is clearly visible that the TPS > flatten out already at 8 parallel streams, while the planning time is > increasing drastically. > > Parallel streams | TPS (before) | Planning time (before) > -----------------|--------------|----------------------- > 1 | 5,486 | 0.13 ms > 2 | 8,098 | 0.22 ms > 4 | 15,013 | 0.19 ms > 8 | 27,107 | 0.29 ms > 16 | 30,938 | 0.43 ms > 32 | 26,330 | 1.68 ms > 64 | 24,314 | 2.48 ms > > We tracked down the root cause of this slowdown to lock contention in > 'get_relation_info()'. The index lock of every single index of every single > table used in that query is acquired. We attempted a fix by pre-filtering > out all indexes that anyways cannot be used with a certain query, without > taking the index locks (credits to Luc Vlaming for idea and > implementation). The patch does so by caching the columns present in every > index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before > opening (= locking) the indexes in 'get_relation_info()', we check if the > index can actually contribute to the query and if not it is discarded right > away. Caching the index info saves considerable work for every query run > subsequently, because less indexes must be inspected and thereby locked. > This way we also save cycles in any code that later on goes over all > relation indexes. > > The work-in-progress version of the patch is attached. It is still fairly > rough (e.g. uses a global variable, selects the best index in scans without > restrictions by column count instead of physical column size, is missing > some renaming, etc.), but shows the principle. > > The following table shows the TPS, planning time and speed-ups after > applying the patch and rerunning above described benchmark. Now, the > planning time remains roughly constant and TPS roughly doubles each time > the number of parallel streams is doubled. The higher the stream count the > more severe the lock contention is and the more pronounced the gained > speed-up gets. Interestingly, even for a single query stream the speed-up > in planning time is already very significant. This applies also for lower > index counts. For example just with 10 indexes the TPS for a single query > stream goes from 9,159 to 12,558. We can do more measurements if there is > interest in details for a lower number of indexes. > > Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS | > Speed-up planning > > -----------------|-------------|-----------------------|--------------|------------------ > 1 | 10,344 | 0.046 | 1.9x | > 2.8x > 2 | 20,140 | 0.045 ms | 2.5x | > 4.9x > 4 | 40,349 | 0.047 ms | 2.7x | > 4.0x > 8 | 80,121 | 0.046 ms | 3.0x | > 6.3x > 16 | 152,632 | 0.051 ms | 4.9x | > 8.4x > 32 | 301,359 | 0.052 ms | 11.4x | > 32.3x > 64 | 525,115 | 0.062 ms | 21.6x | > 40.0x > > We are happy to receive your feedback and polish up the patch. > > -- > David Geier > (ServiceNow) >